Introduction

This is the fifth installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate k-means clustering using the Digit Recognizer competition.

Outline

  1. Functions to process the data
  2. Import and examine the data
  3. Cluster data into 10 categories (one for each digit)
  4. Evaluate model results
  5. Summary

Import Necessary Modules


In [17]:
import pandas as pd
import numpy as np
import sklearn.cluster as skc
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

0. Functions to Process Data


In [5]:
def ij2index(ii,jj):
    """
    Converts pixel indices ii (row) and jj (column)
    to a single value in the grid below:
    
         jj=0 jj=1 jj=2 jj=3   jj=26 jj=27
    ii=0  000  001  002  003 ... 026  027
    ii=1  028  029  030  031 ... 054  055
    ii=2  056  057  058  059 ... 082  083
           |    |    |    |  ...  |    |
    ii=26 728  729  730  731 ... 754  755
    ii=27 756  757  758  759 ... 782  783
    """
    
    # Number of ii,jj
    nI = 28
    nJ = 28
    return ii*nJ + jj
    
def index2ij(index):
    """
    Converts 1D index to 2D pixel indices 
    ii (row) and jj (column) from the grid below:
    
         jj=0 jj=1 jj=2 jj=3   jj=26 jj=27
    ii=0  000  001  002  003 ... 026  027
    ii=1  028  029  030  031 ... 054  055
    ii=2  056  057  058  059 ... 082  083
           |    |    |    |  ...  |    |
    ii=26 728  729  730  731 ... 754  755
    ii=27 756  757  758  759 ... 782  783
    """
    
    # Number of ii,jj
    nI = 28
    nJ = 28
    jj = index%nJ
    ii = (index-jj)/nI
    return (ii,jj)

def convert

1. Read Digit Data


In [10]:
data = pd.read_csv("./data/digits/train.csv")
data.head()


Out[10]:
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns


In [11]:
# Split up digit images and labels
targets = data['label']
digits = data.drop('label',axis=1)

In [16]:
# Plot one of the digits
plt.imshow(digits.loc[1000].reshape(28,28),cmap=cm.Greys,interpolation='none')


Out[16]:
<matplotlib.image.AxesImage at 0x11f70f990>

In [60]:
# Plot frequency of digits in the dataset
targets.hist()


Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x10862db10>

2. Cluster data into 10 categories

We use k-means to classify the available greyscale images into 10 categories. I do not anticipate a clear relationship between the resulting clusters and the 10 digits (0-9) because our method is neither scale, translation, or rotaion invariant.

Nevertheless, let's try and see how it goes!


In [55]:
model = skc.KMeans(n_clusters=10,n_init=1,random_state=1)

In [56]:
model.fit(digits)


Out[56]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=1,
    n_jobs=1, precompute_distances=True, random_state=1, tol=0.0001,
    verbose=0)

In [68]:
output = model.predict(digits)

3. Evaluate Model Results


In [69]:
# Plot the center of each cluster returned from the k-means algorithm
for ii in range(10):
    plt.subplot(2,5,ii+1)
    plt.imshow(model.cluster_centers_[ii,:].reshape(28,28),cmap=cm.Greys,interpolation='none')
    plt.title('ii = {}'.format(ii))


This is better than I expected! Several of the centroids clearly correspond to recognizable digits. Some deficiences include no digit "5", two digits "0", and strong similarities between digits "4", "7", and "9".


In [70]:
# Plot number of predicted values in each cluster
output = model.predict(digits)
height,left = np.histogram(output)
plt.bar(left[:-1],height)


Out[70]:
<Container object of 10 artists>

This histogram of model predictions reveals more shortcomings. Bins 5 and 7, corresponding to digit "0", contain relatively few numbers. Bin 1, on the other hand, seems to correspond to the digit "1", but the histogram above shows that many other digits have been placed into that bin.


In [83]:
# Visually associate the cluster centers with a digit from 0-9
def real2centroid(number):
    """Return the model centroid index associated with the real digit value"""
    realvalues = [7,1,0,6,2,5,8,4,9,3]
    return realvalues[number]
    
def centroid2real(number):
    """Return the real number associated with a given centroid index."""
    centroids = [2,1,4,9,7,0,3,0,6,8]
    return centroids[number]

In [81]:
# Convert the model output to predicted digit labels
output = model.predict(digits)
output = map(centroid2real,output)

In [82]:
# Calculate fraction correct
print 1-(sum((output-targets)!=0))/float(len(output))


0.595428571429

4. Summary

A very simple K-means implementation placed handwritten digits into 10 bins without any labels. In many cases, these bins clearly corresponded to actual digits. We visualize the bins, assign each one a real numerical value, and find that the k-means algorithm correctly categorizes 60% of the digits. That's rather impressive for a straightforward implementation of an unsupervised learning algorithm!